In a recent analyisis we reviewed the performance of every National Team in the world. However evolution of the style of game is a fact that was not contemplated. So, in this post we can visualize how in FIFA World Cup every nation is evolving at different rates.

library(tidyverse)
library(plotly)
library(lubridate)

Dataset

The first thing is to read files. I downloaded this project at 2021-07-22 from Kaggle.

results <- read.csv("results.csv", encoding = "UTF-8")

This dataset contains data about \(42k+\) football matches in the history of international encounters between national teams. So, let’s take a little taste of the data:

head(results)

Analysis

One interesting thing is to take a look at the context of the matches, some of them could be not relevant at all, however, there is also World cup matches, continental tournaments, and so on:

levels(as.factor(results$tournament)) -> tournaments
sample(tournaments,20)
##  [1] "South Pacific Games"                       
##  [2] "USA Cup"                                   
##  [3] "Nations Cup"                               
##  [4] "Vietnam Independence Cup"                  
##  [5] "CONCACAF Nations League"                   
##  [6] "CFU Caribbean Cup"                         
##  [7] "FIFA Arab Cup qualification"               
##  [8] "United Arab Emirates Friendship Tournament"
##  [9] "International Cup"                         
## [10] "Nordic Championship"                       
## [11] "World Unity Cup"                           
## [12] "EAFF Championship"                         
## [13] "WAFF Championship"                         
## [14] "FIFA World Cup qualification"              
## [15] "Gold Cup qualification"                    
## [16] "AFC Challenge Cup"                         
## [17] "Pan American Championship"                 
## [18] "Oceania Nations Cup qualification"         
## [19] "CONCACAF Nations League qualification"     
## [20] "Dynasty Cup"

Filtering by tournaments with at least 100 matches played in the history:

results %>%
  group_by(tournament) %>%
  summarise(count=n()) %>%
  filter(count > 100) %>%
  select(tournament) -> popularCups

results %>%
  filter(tournament %in% popularCups$tournament) %>%
  ggplot(aes(x=tournament, fill=tournament)) +
  geom_bar() +
  coord_flip() +
  labs(title="Matches in tournaments") -> p 
ggplotly(p)

Now we need to process a little bit of the data to assign a standard way to provide points based on the outcome of every match:

Points Outcome
\(3\) Victory
\(1\) Tie
\(0\) Defeat

In FIFA scores, 2 points can be achieved by winning a shootout after a tied match, however, I ignored that for the following analysis

Let’s take a look on how it looks now:

results %>%
  mutate(tied=ifelse(home_score == away_score,TRUE,FALSE)) %>%
  mutate(home_points=ifelse(tied == TRUE,1,ifelse(home_score > away_score,3,0))) %>%
  mutate(away_points=ifelse(tied == TRUE,1,ifelse(home_score > away_score,0,3))) -> results

results %>%
  filter(grepl("FIFA World Cup",tournament)) -> worldCupResults
head(worldCupResults)

After this step we also need to transform a little bit the structure of this dataset in order to measure the performance of each National Team in this way:

Then we can see how it looks (for tournaments that contain "FIFA World Cup" in its name).

results %>%
  pivot_longer(c(home_team,away_team),names_to = "homeaway", values_to = "team") %>%
  mutate(points=ifelse(grepl("home",homeaway),home_points,away_points),
         goals=ifelse(grepl("home",homeaway),home_score,away_score),
         receivedGoals=ifelse(grepl("home",homeaway),away_score,home_score)) %>%
  select(date,tournament,country,team,points,goals,receivedGoals) -> results

results %>%
  filter(grepl("FIFA World Cup",tournament)) -> worldCupResults

FIFA World Cup (and qualifiers)

The most interesting matches occur at FIFA World Cup. So we can focus on what happens in this tournament:

worldCupResults %>% filter(!grepl("qualifi",tournament)) %>% mutate(yr=year(date)) %>% group_by(yr,team) %>% summarise( p=sum(points), goals=sum(goals), against=sum(receivedGoals),matches=n()) %>% mutate( performance=p/matches, ofensive=goals/matches, defense=against/matches) %>% ggplot(aes(x=yr, y=performance, fill=team)) + geom_bar(stat="identity") -> p
## `summarise()` has grouped output by 'yr'. You can override using the `.groups` argument.
ggplotly(p)

Germany emerges as the best in performance over all the matches related to the World Cup. Is not a surprise at all, remember all of the “goleadas” that has produced, in the qualifiers as well as in the knock-out matches in the final stages of the tournament.

Now we can take a look at what happens if we focus only on the final stage, I mean filtering out the qualifiers:

worldCupResults %>% filter(!grepl("qualifi",tournament)) %>% mutate(yr=year(date)) %>% group_by(yr,team) %>% summarise( p=sum(points), goals=sum(goals), against=sum(receivedGoals),matches=n()) %>% mutate( performance=p/matches, ofensive=goals/matches, defense=against/matches) %>% filter(team %in% c("Mexico","Brazil","Argentina","Germany","France")) %>% ggplot(aes(x=yr, y=performance, color=team)) + geom_line() -> p
## `summarise()` has grouped output by 'yr'. You can override using the `.groups` argument.
ggplotly(p)
worldCupResults %>% filter(!grepl("qualifi",tournament)) %>% mutate(yr=year(date)) %>% group_by(yr,team) %>% summarise( p=sum(points), goals=sum(goals), against=sum(receivedGoals),matches=n()) %>% mutate( performance=p/matches, ofensive=goals/matches, defense=against/matches) %>% mutate(differenceGoal=ofensive-defense) %>% ggplot(aes(x=yr, color=team, y=differenceGoal)) + geom_line() -> p
## `summarise()` has grouped output by 'yr'. You can override using the `.groups` argument.
ggplotly(p)